MS5130_3rd_Assignment_23105421

Author: Anup Kamath

Introduction

Greetings, this document has been crafted to satisfy the criteria of the MS5130 course, Applied Analytics in Business and Society. It encompasses a range of analyses including quantitative analysis, visualization utilizing the leaflet package, interactive graphs, and fundamental visualizations.

Part A

In Part I, I will showcase some of the basic enhancements (BEs) I have implemented. These enhancements primarily involve utilizing Markdown, Quarto, dataset reading, preprocessing, merging, and conducting quantitative analysis

Importing files

The initial steps involve establishing the directory and importing necessary files. The data-sets essential for the analysis were sourced from Words (2021) . The data-set contains information on various parameters such as life expectancy, fertility rate, corruption, etc., pertaining to different countries.This will full-fill our BE2- Use multiple data-sets task.

Code
print("the current directory is")
[1] "the current directory is"
Code
getwd()
[1] "C:/Users/ANUP KAMATH/Documents/GitHub/R-Assignment"
Code
print("Lets set the directory as per our requirement")
[1] "Lets set the directory as per our requirement"
Code
setwd("C:/Users/ANUP KAMATH/Documents/GitHub/R-Assignment")
print("The new directory is")
[1] "The new directory is"
Code
getwd()
[1] "C:/Users/ANUP KAMATH/Documents/GitHub/R-Assignment"
Code
#importing 1st file
file1<-read.csv(file="C:/Users/ANUP KAMATH/Desktop/children-born-per-woman.csv")
print("The first few rows of 1st file are")
[1] "The first few rows of 1st file are"
Code
head(file1)
       Entity Code Year Fertility.rate..Gapminder..v12..2017.
1 Afghanistan  AFG 1950                                  7.57
2 Afghanistan  AFG 1951                                  7.56
3 Afghanistan  AFG 1952                                  7.55
4 Afghanistan  AFG 1953                                  7.54
5 Afghanistan  AFG 1954                                  7.53
6 Afghanistan  AFG 1955                                  7.52
Code
#importing 2nd file
file2<-read.csv(file="C:/Users/ANUP KAMATH/Desktop/world-happiness-report.csv")
print("The first few rows of 2nd file are")
[1] "The first few rows of 2nd file are"
Code
head(file2)
  Country.name year Life.Ladder Log.GDP.per.capita Social.support
1  Afghanistan 2008       3.724              7.370          0.451
2  Afghanistan 2009       4.402              7.540          0.552
3  Afghanistan 2010       4.758              7.647          0.539
4  Afghanistan 2011       3.832              7.620          0.521
5  Afghanistan 2012       3.783              7.705          0.521
6  Afghanistan 2013       3.572              7.725          0.484
  Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
1                            50.80                        0.718      0.168
2                            51.20                        0.679      0.190
3                            51.60                        0.600      0.121
4                            51.92                        0.496      0.162
5                            52.24                        0.531      0.236
6                            52.56                        0.578      0.061
  Perceptions.of.corruption Positive.affect Negative.affect
1                     0.882           0.518           0.258
2                     0.850           0.584           0.237
3                     0.707           0.618           0.275
4                     0.731           0.611           0.267
5                     0.776           0.710           0.268
6                     0.823           0.621           0.273
Code
#importing 3rd file
file3<-read.csv(file="C:/Users/ANUP KAMATH/Desktop/population_by_country_2020.csv")
print("The first few rows of 3rd file are")
[1] "The first few rows of 3rd file are"
Code
head(file3)
  Country..or.dependency. Population..2020. Yearly.Change Net.Change
1                   China        1440297825        0.39 %    5540090
2                   India        1382345085        0.99 %   13586631
3           United States         331341050        0.59 %    1937734
4               Indonesia         274021604        1.07 %    2898047
5                Pakistan         221612785        2.00 %    4327022
6                  Brazil         212821986        0.72 %    1509890
  Density..P.Km.. Land.Area..Km.. Migrants..net. Fert..Rate Med..Age
1             153         9388211        -348399        1.7       38
2             464         2973190        -532687        2.2       28
3              36         9147420         954806        1.8       38
4             151         1811570         -98955        2.3       30
5             287          770880        -233379        3.6       23
6              25         8358140          21200        1.7       33
  Urban.Pop.. World.Share
1        61 %     18.47 %
2        35 %     17.70 %
3        83 %      4.25 %
4        56 %      3.51 %
5        35 %      2.83 %
6        88 %      2.73 %

Preprosseing dataset

Now, let’s preprocess the data to facilitate easy joining. We have selected only the columns necessary for our analysis, and the rest have been dropped. We have only retained data with respect to year 2020. The names of the columns has been changed.

Code
#file 1 changes
subset_a_2020 <- file1[file1$Year == 2020, ]
#head(subset_a_2020)

subset_a_2020 <- subset_a_2020[, c("Entity", "Code", 
                                   "Fertility.rate..Gapminder..v12..2017.")]
names(subset_a_2020) <- c("country", "code", "fertility_rate")
print("The new modified file 1 data-setlooks like this")
[1] "The new modified file 1 data-setlooks like this"
Code
head(subset_a_2020)
                country code fertility_rate
71          Afghanistan  AFG           4.04
162             Albania  ALB           1.70
235             Algeria  DZA           2.54
308              Angola  AGO           5.41
381 Antigua and Barbuda  ATG           2.01
535           Argentina  ARG           2.23
Code
#file 2 changes
subset_b_2020 <- file2[file2$year == 2020, ]

# View the subset
#head(subset_b_2020)
subset_b_2020 <- subset_b_2020[, c("Country.name", "Healthy.life.expectancy.at.birth", 
                                   "Freedom.to.make.life.choices", "Generosity", 
                                   "Perceptions.of.corruption")]
names(subset_b_2020) <- c("country", "life_expectancy", "freedom","generosity","corruption")

# View the modified subset
print("The modified file 2 data-setlooks like this")
[1] "The modified file 2 data-setlooks like this"
Code
head(subset_b_2020)
       country life_expectancy freedom generosity corruption
25     Albania            69.3   0.754      0.007      0.891
52   Argentina            69.2   0.823     -0.122      0.816
80   Australia            74.2   0.905      0.210      0.491
93     Austria            73.6   0.912      0.011      0.464
118    Bahrain            69.7   0.945      0.132         NA
133 Bangladesh            65.3   0.777     -0.009      0.742
Code
#file 3 changes
#head(file3)
#colnames(file3)
subset_c_2020 <- file3[, c("Country..or.dependency.", "Population..2020.", 
                           "Density..P.Km..")]

names(subset_c_2020)<- c("country", "population", "density_pkm")
print("The modified file 3 data-setlooks like this")
[1] "The modified file 3 data-setlooks like this"
Code
head(subset_c_2020)
        country population density_pkm
1         China 1440297825         153
2         India 1382345085         464
3 United States  331341050          36
4     Indonesia  274021604         151
5      Pakistan  221612785         287
6        Brazil  212821986          25

Joining Datasets

It’s time to join the 3 data-sets into 1. I have used dplyr libraries inner join to join them. As R is case sensitive I had to change all country names to lower caps to succed in this task.This will full-fill our BE3- Combine data-sets together task.

.

Code
#joining tables
library(dplyr)

# Convert country column to lowercase in all dataframes
subset_a_2020 <- subset_a_2020 %>% mutate(country = tolower(country))
subset_b_2020 <- subset_b_2020 %>% mutate(country = tolower(country))
subset_c_2020 <- subset_c_2020 %>% mutate(country = tolower(country))


#join data
country_data <- subset_b_2020 %>%
  inner_join(subset_a_2020, by = "country") %>%
  inner_join(subset_c_2020, by = "country")

#show new dataset
print("The new data-set looks like this")
[1] "The new data-set looks like this"
Code
head(country_data)
     country life_expectancy freedom generosity corruption code fertility_rate
1    albania            69.3   0.754      0.007      0.891  ALB           1.70
2  argentina            69.2   0.823     -0.122      0.816  ARG           2.23
3  australia            74.2   0.905      0.210      0.491  AUS           1.81
4    austria            73.6   0.912      0.011      0.464  AUT           1.54
5    bahrain            69.7   0.945      0.132         NA  BHR           1.94
6 bangladesh            65.3   0.777     -0.009      0.742  BGD           2.00
  population density_pkm
1    2877239         105
2   45267449          17
3   25550683           3
4    9015361         109
5    1711057        2239
6  164972348        1265

Quantitative Analysis

Due to the inadequacy of my dataset in producing satisfactory models, I opted to convert my target variable, corruption, into a factor Overflow (2012) . With this new column, I proceeded to run GLM and GAM models. As my datasets do not allow for qualitative models like text mining, I focused on satisfying the deliverable (BE4) by incorporating a synergy of quantitative and qualitative analyses through GLM and GAM models.

GLM model

A Generalized Linear Model (GLM) is a statistical framework that allows for the analysis of data with diverse response distributions.Here I have used the corruption_factor as a tagert variable and freedom column as the independent variable.

Code
#quantitative analysis- glm model
library(dplyr)


# Converting corruption into factor inorder to run glm model
country_data <- country_data %>%
  mutate(corruption_factor = ifelse(corruption > 0.6, 1, 0))

# Convert 'corruption_factor' to a factor variable
country_data$corruption_factor <- factor(country_data$corruption_factor)

# running glm model
set.seed(100)

# Split data into training and testing sets
sample_indices <- sample(1:nrow(country_data), 70)
test_data <- country_data[sample_indices, ]
train_data <- country_data[-sample_indices, ]

# Fit the logistic regression model
glm_model <- glm(corruption_factor ~ freedom, family = binomial(link = "logit"), data = train_data)

# Summary of the model
summary(glm_model)

Call:
glm(formula = corruption_factor ~ freedom, family = binomial(link = "logit"), 
    data = train_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    7.448      5.508   1.352    0.176
freedom       -7.542      6.432  -1.173    0.241

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 18.550  on 16  degrees of freedom
Residual deviance: 16.935  on 15  degrees of freedom
  (2 observations deleted due to missingness)
AIC: 20.935

Number of Fisher Scoring iterations: 4
Code
# Plot the model
plot(glm_model)

The Fisher Scoring algorithm converged in 4 iterations. The obtained p-value of 0.176 suggests insignificance, indicating no significant relationship between corruption and freedom. The null deviance (18.550) and residual deviance (16.593) represent the deviation of the fitted model, with lower values indicating better model fit.

GAM model

A Generalized Additive Model (GAM) is a flexible extension of the Generalized Linear Model (GLM) that incorporates non-linear relationships between predictors and the response variable. The GAM model was built with corruption as the target variable, freedom and generosity were the independent variable.

Code
#gam model
#install.packages("gam")
library(mgcv)

# Fit a GAM model
gam_model <- gam(corruption ~ s(freedom ) + s(generosity), data = country_data)

# Summary of the model
summary(gam_model)

Family: gaussian 
Link function: identity 

Formula:
corruption ~ s(freedom) + s(generosity)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.70340    0.01923   36.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df      F  p-value    
s(freedom)    2.23  2.803 12.819 2.96e-06 ***
s(generosity) 1.00  1.000  0.024    0.877    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.31   Deviance explained = 33.9%
GCV = 0.030496  Scale est. = 0.028842  n = 78
Code
plot(gam_model)

The adjusted R-squared is 0.31, indicating that 31% of the variation in corruption is explained by the model. The deviance explained is 33.9%. The smooth term for freedom has an effective degrees of freedom (edf) of 2.23 and a highly significant F-statistic (p < 2.96e-06), suggesting a non-linear relationship with corruption. However, the smooth term for generosity is not significant (p = 0.877), implying a linear relationship.

Part B

In this sections I will be performing tasks for superior enhancements (SEs)

Mermaid design

The mermaid tool is used to describe the flow of data and also varibles used to do varies analysis and visualization.This task will fulfill the (SE1) Depict your data streams using Mermaid.

graph LR
     A[children-born-per-woman]-- import data --> D[subset_a_2020]
    B[world-happiness-report]-- import data --> E[subset_b_2020]
    C[population_by_country_2020]-- import data  --> F[subset_c_2020]
    D[subset_a_2020]-- Preprocess data --> M((country_data))
    E[subset_b_2020]-- Preprocess data  --> M((country_data))
    F[subset_c_2020]-- Preprocess data --> M((country_data))
    
   M((country_data))-- GLM model-->G[GLM Model]
   G[GLM Model]-- variables-->H[corruption & Freedom]
    M((country_data))-- GAM model-->I[GAM Model]
   I[GAM Model]-- variables-->J[corruption,generosity & Freedom]

    M((country_data))-- leaflet-->K[ Geographic map]--variables-->WY[Population,Corruption]
    B[world-happiness-report]-- interactive graph --> L[ Scattered plot]
    M((country_data))-->N[Bar plot]--variables-->W[Population]
    M((country_data))-->Y[Bar plot]--variables-->S[Life expectancy]
    M((country_data))-->O[Scattere Plot]--variables-->X[Fertility Rate vs. Life Expectancy]
    M((country_data))-->U[Box Plot]--variables-->R[corruption,generosity,fertility rate,life expectancy]

Leaflet Analysis

Using Leaflet library we can map various parameters on the world/area maps.This task will fulfill the (SE3) Use of geographical data analysis using Leaflet.

Adding Latitude and Longitude

While doing the analysis I was not able to map parameters on the world map as latitudes and logitude werentpresent in my dataset. I used Generative AI to add these columns to my dataset OpenAI (2024)

Code
#install required packages
#install.packages("readr")
#install.packages("dplyr")
#install.packages("tidyverse")
library(readr)
library(dplyr)
library(tidyverse)
library(mgcv)

#if (!require("tm")) install.packages("tm")
#if (!require("wordcloud")) install.packages("wordcloud")
#if (!require("RColorBrewer")) install.packages("RColorBrewer")

library(tm)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(tidyr)

if (!require("leaflet")) install.packages("leaflet")
if (!require("countrycode")) install.packages("countrycode")

library(leaflet)
library(countrycode)

#install.packages("plotly")
library(plotly)


#head(country_data)
library(maps)
library(ggplot2)

world_map <- map_data("world")

# Convert country names in anuata to lowercase
country_data$country <- tolower(country_data$country)

# Convert region names in world_map (which represents countries here) to lowercase
world_map$region <- tolower(world_map$region)

# unique list of countries with their mean latitude and longitude
country_coords <- world_map %>%
  group_by(region) %>%
  summarize(lat = mean(lat), lon = mean(long), .groups = 'drop')


# Merge the coordinates
country_data <- merge(country_data, country_coords, by.x = "country", by.y = "region", all.x = TRUE)

non_matching <- country_data[is.na(country_data$lat), "country"]
unique(non_matching)
[1] "united kingdom" "united states" 
Code
#US and UK latitude and logitude missing

#adding lat & long for US
country_data$lat <- ifelse(country_data$country == "united states", 37.0902, country_data$lat)
country_data$lon <- ifelse(country_data$country == "united states", -95.7129, country_data$lon)

# adding lat & long for Uk
country_data$lat <- ifelse(country_data$country == "united kingdom", 55.3781, country_data$lat)
country_data$lon <- ifelse(country_data$country == "united kingdom", -3.4360, country_data$lon)

#tail(country_data)

Geographical Analysis

The Map visualisation using Leaflet library

Code
library(leaflet)

world_map_leaflet <- leaflet(data = country_data) %>%
  addTiles() %>%
  setView(lng = 0, lat = 30, zoom = 2)  


world_map_leaflet <- world_map_leaflet %>%
  addMarkers(
    lng = ~lon,  
    lat = ~lat,  
    popup = ~paste0("<b>Country:</b> ", country, "<br>",
                    "<b>Population:</b> ", population, "<br>",
                    "<b>Corruption:</b> ", corruption)  
  )

# Display the map
world_map_leaflet

Interactive plot

Interactive plot allows you move around in the final output file, like pop-up zoom in zoom out etc. This will full-fill the SE4- Use of interactive charts/graphs/plots task R Graph Gallery (Accessed 2024)

Code
#creating a subset 
selected_countries <- c("Afghanistan", "United States", "China", "Ireland", "India")

# Create a subset where the "Country.name" is in the selected countries list
subset_file2 <- subset(file2, Country.name %in% selected_countries)

# View the subset
#print(subset_file2)


#creating interactive graph
library(plotly)
library(dplyr)
library(ggplot2)
# Assuming 'subset_file2' is your dataframe

# Create Plotly plot
fig <- plot_ly(subset_file2, x = ~year, y = ~Healthy.life.expectancy.at.birth, 
               color = ~Country.name, text = ~Country.name) %>%
  add_markers() %>%
  layout(title = "Healthy Life Expectancy at Birth Over Years",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Healthy Life Expectancy at Birth"),
         hovermode = "closest")


# Save the interactive plot to an HTML file
htmlwidgets::saveWidget(fig, "fig.html")

Part C

In this section, I will be showing some visualisation, which will give us some valuable insights.

Bar Plot-Population of Top 5 Countries

Top 5 countries based on population

Code
library(ggplot2)

# Subset 
top_5_populated <- head(country_data[order(-country_data$population), ], 5)

# Create the bar plot
ggplot(data = top_5_populated, aes(x = country, y = population, fill = country)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = population), vjust = -0.5, size = 3) +  
  labs(x = "Country", y = "Population", title = "Population of Top 5 Countries") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Bar Plot-Life Expectancy of Top 5 populated Countries

Life expectancy of the top 5 populated countries

Code
ggplot(data = top_5_populated, aes(x = country, y = life_expectancy, fill = country)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = life_expectancy), vjust = -0.5, size = 3) +  
  labs(x = "Country", y = "Life Expectancy", title = "Life Expectancy of Top 5 populated Countries") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Scattered Plot-Fertility Rate vs. Life Expectancy for Selected countries

Scatter plot dpeciting how 20 most populated countries perform with respect to fertility rate and life expectancy.

Code
library(ggplot2)

top_20_populated <- head(country_data[order(-country_data$population), ], 20)  

ggplot(top_20_populated, aes(x = fertility_rate, y = life_expectancy, size = population, color = country)) +
  geom_point(alpha = 0.7) +
  geom_text(aes(label = country), vjust = -0.5, hjust = 0.5, size = 3) +  # Add country labels
  scale_size_continuous(range = c(1, 10)) +
  labs(title = "Fertility Rate vs. Life Expectancy for Selected countries",
       x = "Fertility Rate",
       y = "Life Expectancy") +
  theme_minimal()

Box plots

Box plot for various paramters to know median value and also spread.

Code
library(ggplot2)
library(cowplot)

# Box plot for life expectancy
plot1 <- ggplot(country_data, aes(y = life_expectancy)) +
  geom_boxplot(color = "blue", fill = "lightblue") +
  labs(title = "Box Plot of Life Expectancy")


# Box plot for generosity
plot2 <- ggplot(country_data, aes(y = generosity)) +
  geom_boxplot(color = "red", fill = "pink") +
  labs(title = "Box Plot of Generosity")

# Box plot for corruption
plot3 <- ggplot(country_data, aes(y = corruption)) +
  geom_boxplot(color = "purple", fill = "lavender") +
  labs(title = "Box Plot of Corruption")

# Box plot for fertility rate
plot4 <- ggplot(country_data, aes(y = fertility_rate)) +
  geom_boxplot(color = "orange", fill = "peachpuff") +
  labs(title = "Box Plot of Fertility Rate")

# Arrange plots in one image
plot_grid(plot1, plot2, plot3, plot4, ncol = 2)

Part D-Notes

    • Due to limitation with respect to data-set,qualitative analysis couldn’t be performed.

    • One of the major limitation of this data-set is it has only 89 columns after joining, hence the model may not be accurate.

    • BE1-executing R code in Quarto, BE5-explanatory - has been done throughout the assignment

    • The all BE tasks has been performed

    • SE2- Use of private Github repository - This has been created recently and has been shared with the Professor.

    • Thus SE1,SE2,SE3,SE4 has been completed.

    • The video and reference document has been uploaded on canvas

    • Overflow (2012) was used to do basic changes to quarto documents

    • Author(s) or Channel Name (Year Published) was used to add bibliography to this document

  • Thank you!

References

Author(s) or Channel Name. Year Published. “Title of the Video.” https://www.youtube.com/watch?v=dbljY7jxrSA.
OpenAI. 2024. “OpenAI ChatGPT.” Online. https://chat.openai.com/.
Overflow, Stack. 2012. “Convert Data Frame Column Format from Character to Factor.” https://stackoverflow.com/questions/9251326/convert-data-frame-column-format-from-character-to-factor.
R Graph Gallery. Accessed 2024. “Interactive Charts in r.” https://r-graph-gallery.com/interactive-charts.html.
Words, Joshua’s. 2021. “Awesome EDA 2021: Happiness & Population.” Kaggle notebook. https://www.kaggle.com/code/joshuaswords/awesome-eda-2021-happiness-population/notebook.